United States Patent [i9] 

Emma et al 



[54] MULTI-PREDICTION BRANCH 
PREDICnON MECHANISM 

[75] Inventors: Philip G. Emma, Danbury, Conn.; 

Joshua W. Knight, Mohegan Lake; 
James H. Po merene, Chappaqua, 
both of N.Y.; Thomas R. Puzalt, 
Ridgefield, Conn.; Rudolph N, 
Rechtschaffen, Scarsdale; James R. 
Robinson, Clinton Corners, both of 
N.Y. 

[73] Assignee: International Business Machines 
Corporation, Armonk, N.Y. 

[21] Appl. No.: 91,416 

[22] Filed: Jul. 13, 1993 

Related U.S. Application Data 

[63] Continuation of Ser. No. 594,529, Oct. 9, 1990. aban- 
doned. 

[51] Int. a J G06F9/38 

[52] U.S. CI 395/375; 364/261.3; 

364/261.5; 364/263.1; 364/255.7; 364/231.8; 

364/DIG. 1 




US005353421A 

[11] Patent Number: 5,353,421 
[45] Date of Patent: Oct. 4, 1994 



[58] Field of Search 395/375, 800 

[56] References Cited 

U.S. PATENT DOCUMENTS 

4,853,840 8/1989 Shibuya 395/375 

4,894,772 1/1990 Langendrof 395/375 

4,943,908 7/1990 Emma et al 395/375 

5,121,473 5/1992 Hodges 395/375 

5,142,634 8/1992 Fiteetal 395/375 

Primary Examiner — P. S. Lall 



Assistant Examiner — Ayni Mohamed 

Attorney, Agent, or Firm — Whitman, Curtis, Whitman & 

McGinn . 

[57] ABSTRACT 

A multi-prediction branch prediction mechanism pre- 
dicts each conditional branch at least twice, first during 
the instruction-fetch phase of the pipeline and then 
again during the decode phase of the pipeline. The 
mechanism uses at least two different branch prediction 
mechanisms, each a separate and independent mecha- 
nism from the other. A set of rules are used to resolve 
those instances as to when the predictions differ. 
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table is referred to as a Decode History Table (DHT), 
MULTI-PREDICTION BRANCH PREDICTION and combinatorial logic determines the guess from the 
MECHANISM value found in the table. No attempt is made to predict 

the branch target, since this is known at decode time; 
This is a continuation of application Ser. No. 5 just the outcome of the branch is predicted. The DHT 
07/594,529 filed Oct. 9, 1990 and now abandoned. is used to predict the outcome of only the conditional 

BACKGROUND OF THE INVENTION I. Field of ^'^^^Ij^' . ""^^^^"^^ .^^f unconditional 

the Invention branch is exphcitly known once it is decoded. 

U.S. Pat. No. 3,325,785 to Stephens describes a mech- 

The present invention generally relates to the field of 10 anism by which the outcome of branch is predicted 

data processing and, more particularly, to predicting based on the type of branch and statistical experience as 

the outcome of conditional branches, either taken or not to whether the branch will be taken. Another branch 

taken, in the processor of a computer. strategy describes suspending the pipeline until the 

2. Description of the Prior Art branch is fully executed. The outcome of the branch is 

In most high performance processors, pipelining is 15 then known, either taken or not taken, and the correct 

used as a means to improve performance. Pipelining instruction can then be fetched and processed through 

allows a processor to be divided into separate compo- the pipeline. This strategy, however, results in several 

nents where each component is responsible for complet- cycles of pipeline delay (idle cycles) per branch, 

ing a different phase of an instruction's execution. For U.S. Pat. No. 4,181,942 to Forster et al. describes a 

example, FIG. 1 shows the major components that 20 mechanism by which a special branch instruction is 

make up a processor's pipeline. The components are: used in a processor to indicate the type of branch, either 

Instruction fetch (stage I), instruction decode and ad- conditional or unconditional as determined by the state 

dress generation (stage II), operand fetch (stage HI), of an internal register. The special branch instruction is 

instruction execution (stage IV), and put away of the used for program control at the end of a program loop 

results (stage V), Each instruction enters the pipeline 25 and for unconditional branching outside of the loop, 

and ideally spends one cycle at each stage of the pipe- U.S. Pat. No. 4,200,927 to Hughes et al. describes a 

line. Individually, each instruction takes five cycles to mechanism by which multiple instruction buffers are 

pass through the pipeline. However, if the pipeline can addressed and filled based on the prediction of each 

be kept full then each component of the processor (pipe- branch that is encountered in the instruction stream, 

line stage) can be kept actively working on a different 30 The prefetching of instructions into each instruction 

instruction, each at a different pipeline stage, and one buffer and the selection of one of the instruction buffers 

instruction can complete in every cycle. Unfortunately, for gatmg instructions into the decoder is controlled by 

keeping the pipeline full is a difficult task. Breaks in the logic which keeps track of the status of each instruction 

pipeline, disruptions, frequently occur and result in idle stream and branch contained in each instruction buffer, 

cycles that can delay an instruction's execution. 35 Branches are guessed based on their type, and result 

The branch instruction is one of the major causes of a signals from the instruction execution unit, in response 
pipeline disruption. The branch instruction introduces a to the execution of conditional branch instructions, will 
temporary uncertainty into the pipeline because, in control the setting of various pointers to allocate new 
order to keep the pipeline full, the processor must guess instruction streams to instruction buffers and to de-allo- 
which one of two possible instructions enters the pipe- 40 cate or reset the instructions streams based on the re- 
line next; the fall through instruction or the target of the suits of branches execution. 

branch. Most high performance processors will guess A more effective strategy is described in U.S. Pat. 
the outcome of the branch before it executes and then No. 3,559,183 to Sussenguth. This patent describes a 
proceed to fetch and decode instructions down the path mechanism that records in a table the address of a set of 
that is guessed (either taken or not taken) . 45 recently executed branches followed by their target- 
By attempting to predict the outcome of the branch, address. This table is referred to as a Branch History 
the processor can keep the pipeline full of instructions Table (BHT). An entry is made for each taken branch 
and, if the outcome of the branch is guessed correctly, that is encountered by the processor, both conditional 
avoid a pipeline disruption. If the branch was guessed and unconditional. The table (BHT) is accessed during 
incorrectly, for example a guess of not taken and the 50 the instruction-fetch (I-fetch) phase of the pipeline 
branch is actually taken, then any of the instructions (stage I of FIG. 1). This allows the BHT to predict the 
that entered the pipeline following the branch are can- outcome of a branch even before the branch instruction 
celed and the pipeline restarts at the correct instruction. has been decoded. Each instruction fetch made by the 
Several patents are directed to branch prediction processor is compared against each branch address 
mechanisms, each having certain advantages and disad- 55 saved in the BHT and, if a match occurs, then a branch 
vantages. Many are based on the observation that most is assumed to be taken and the target-address, also in the 
branches are consistently either taken or not taken, and table, becomes the next instruction-fetch address. In 
if treated individually, consistently branch to the same principle, each instruction fetch address found in the 
target-address. For example, U.S. Pat. No. 4,477,872 to table is predicting that a branch instruction will be 
Losq et al, describes a mechanism by which each condi- 60 found at that address and that the branch will be taken 
tional branch is predicted based on the previous perfor- to the same address as specified by the target-address 
mance of the actions. A table is maintained that records saved in the BHT. If no entry is foimd in the BHT, then 
the actions of each conditional branch, either taken or it is assumed that there is not a branch within the in- 
not taken. Each entry of the table consists of a one bit struction-fetch address (address of the instruction dou- 
value, either a one or zero, indicating if the branch is 65 bleword that is being fetched) or, if there is a branch, it 
taken or not taken, respectively. The table is assessed, is not taken. By accessing the BHT during the instruc- 
using a subset of the address bits that make up the tion-fetching phase of the pipeline, an attempt is made 
branch, each time a conditional branch is decoded. The to find each taken branch as early as possible and fetch 
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the target-address even before the branch instruction 
address is decoded. Ideally, this will avoid any pipeline 
delay caused by taken branches in a pipelined processor. 
Typically, if a processor waits until a branch is decoded 
before fetching its target then a pipeline disruption will 5 
occur (for each taken branch) because it may take sev- 
eral cycles to fetch the target of the branch from either 
the cache or memory. By fetching the target of the 
branch even before the branch is decoded, the BHT 
offers a significant performance improvement over the 
previously mentioned branch prediction mechanisms. 

U.S. Pat. No. 4,679,141 to Pomerene et al. describes a 
branch prediction mechanism that improves the BHT as 
described in U.S. Pat. No. 3,559,183. The BHT is im- 
proved by dividing it into two parts; an active area and 
a backup area. The active area contains entries for a 
small subset of branches which the processor has en- 
countered and the backup area contains all of the other 
branch entries. Mechanisms are described to bring 20 
entries from the backup area in the active area ahead of 
when the processor will use those entries. The small size 
of the active area allows it to be fast and optimally 
placed in the processor's physical layout. 

The prior art patents described above can be divided 25 
into two categories; those that make a prediction for a 
branch at instruction-fetch-time and those that make 
their prediction at decode-time. In the patents to 
Hughes et al., Forster et al., Stephens, and Losq et al., 
each branch is discovered during the decode phase of 
the pipeline (stage II, of FIG. 1) and a guess is then 
provided. For this reason, only conditional branches 
need to be guessed or predicted by the DHT since the 
branching certainty of all unconditional branches is 35 
known after decode time. These patents will be referred 
to as decode-time branch-prediction mechanisms. Note, 
that none of these prediction mechanisms attempt to 
guess the target of the branch since this is precisely 
known when the branch is decoded. In contrast, the 40 
patents to Sussenguth and Pomerene et al. describe 
making a branch-prediction guess during the instruc- 
tion-fetch phase of the pipeline (stage I, of FIG. 1) and 
in doing so must predict the outcome for all taken 
branches, both conditional and unconditional, and pre- 45 
diet the target of each taken branch as well. These 
patents will be referred to as the instruction-fetch-time 
branch -prediction mechanisms. Each of the instruction- 
fetch-time branch-prediction mechanisms represent 
significantly more hardware than the decode-time 
branch-prediction mechanisms, but they also offer im- 
proved performance to warrant their implementation. 

For explanatory purposes a brief comparison of the 
amount of hardware needed to implement a BHT and 
DHT is now presented. We begin by comparing the 
amount of hardware in each table used by the BHT and 
DHT. Each entry of a BHT consists of two addresses, 
the branch address followed by the predicted target- 
address whereas each entry in a DHT is represented by 
a single bit, indicating if the branch is taken or not taken. 
Thus, if each address in a BHT is represented as 32 bits, 
then a BHT with IK entries consists of 1024, two ad- 
dress pairs (i.e., 1024 x 64 bits), where each entry is 
represented by 64 bits. Then, when comparing the rela- 65 
tive size of each mechanism we see that a BHT that 
consists of 1 K entries is actually 64 times larger than a 
DHT that consists of IK entries. 



SUMMARY OF THE INVENTION 

It is therefore an object of the present invention to 
provide a mechanism that predicts the outcome of each 
branch a plurality of times and resolves the predictions 
when they do not agree. 

It is another more specific object of the invention to 
provide a multiple-prediction branch prediction mecha- 
nism for predicting the outcome of conditional branches 
in a computer which combines the advantages of 
branch history table with a decode history table to 
attain an increase in performance in the computer while 
at the same time minimizing the hardware overhead 
usually associated with increased performance. 

According to the present invention, referred to as 
multi-prediction branch prediction, a branch prediction 
mechanism predicts each branch at least twice, fu*st 
during the instruction-fetch phase of the pipeline and 
then again during the decode phase of the pipeline. The 
mechanism uses two different branch prediction mecha- 
nisms, each a separate and independent mechanism from 
the other. The mechanism then describes a set of rules 
to predict the branch whenever the two separate branch 
prediction mechanisms disagree. The preferred choice 
for the instruction-fetch-time prediction mechanism is 
the BHT and the preferred decode-time prediction 
mechanism is the DHT. Each mechanism is preferred 
because, independently, the prediction accuracy of each 
mechanism is very high and, when combined, even 
higher guess accuracy rates are obtained. For example, 
it is not uncommon for a BHT or DHT comprised of 4K 
entries to successfully predict the outcome of over 80% 
of the branches encountered by the processor. How- 
ever, other decode-time or instruction-fetch mecha- 
nisms can be substituted as desired. 

It was not immediately apparent that two or more 
branch prediction mechanisms could co-exist in a pipe- 
lined processor. What makes the concept work is the 
use of a set of rules to resolve those instances where the 
predictions differ. The advantage of the invention is 
that a higher correct prediction rate is achieved than 
using a single prediction mechanism alone, and this is 
accomplished with a lower hardware overhead. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The foregoing and other objects, aspects and advan- 
tages will be better understood from the following de- 
tailed description of a preferred embodiment of the 
invention with reference to the drawings, in which; 

FIG. 1 is a generalized block diagram of the pipeline 
stages in a high performance computer; 

FIG. 2 is a block diagram of a prior art processor 
with a branch history table; 

FIG. 3 is a flow diagram illustrating the operations 
associated with the branch history table in the processor 
shown in FIG. 2; 

FIG. 4 is a block diagram of a prior art processor 
with a decode history table; 

FIG. 5 is a block diagram of the sequential prefetch- 
ing logic control in the processor shown in FIG. 4; 

FIG. 6 is a block diagram of the decode history table 
in the processor shown in FIG. 4; 

FIGS. 7A and 7B are tables summarizing, respec- 
tively, the branch history table and the decode history 
table instruction-fetch decode algorithms for the pro- 
cessors shown in FIGS. 2 and 4; 

FIG. 8 is a table summarizing the integrated branch 
history and decode history table instruction-fetch and 
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decode-Lime algorithms according to the preferred em- 
bodiment of the invention; 

FIG. 9 is a table illustrating the branch history table 
directory and segment entry information according to 
the invention; 5 

FIG. 10 is a block diagram of a processor with both 
a branch history table and a decode history table ac- 
cording to the preferred embodiment of the invention; 

FIG. 11 is a block diagram illustrating the branch 
address/target address selection mechanism used in the 10 
preferred embodiment of the invention; 

FIG. 12 is How diagram showing the branch history 
table and decode history table logic; and 

FIG. 13 is a detailed logic diagram of the branch 
history table and decode history table prediction mech- 
anism. 

DETAILED DESCRIPTION OF A PREFERRED 
EMBODIMENT OF THE INVENTION 

The operations of a processor with a standard BHT 
or a DHT are described as a prelude to describing the 
operation of the present invention with multi-prediction 
branch prediction. As described above, the BHT must 
be able to predict a high percentage of its branches 
correctly because there is a severe penalty when it is 
wrong. Normally, many of the prediction errors made 
by the BHT arc discovered only after the branch is 
executed- Several cycles of delay can be saved if each 
error can be predicted earlier. Remember, that the BHT 
makes its prediction during the instruction-fetch phase 
of the pipeline and that it can be several cycles until the 
branch is finally executed, By using the DHT to re- 
predict the outcome of each conditional branch encoun- 
tered by the processor it is possible to detect a potential 35 
prediction error made by the BHT during the decode 
phase of the pipeline. The processor can then be redi- 
rected to the correct instruction stream thus saving 
several cycles of delay. 

The prediction process is now explained. The BHT is 4^ 
used much in the manner described above while the 
DHT is used to confirm the branch predictions made by 
the BHT and, in certain cases when the predictions 
differ, override the prediction made by the BHT. Obvi- 
ously, when both predictions are in agreement, either 45 
both taken or both not taken, there is no need to over- 
ride any branch prediction. However, when the predic- 
tions for a branch differ, then a decision as to the out- 
come of the branch, either taken or not taken, must be 
made. For example, consider the case where the initial 50 
prediction made by the BHT is that the branch is **not 
taken" and that the DHT predicts that the branch will 
be "taken*'. 

That is, let an instruction fetch made by the processor 
miss all "branch address" entries contained m the BHT. 55 
This suggests that there is not a taken branch contained 
within the datum accessed for instructions. This will 
direct any future instruction fetches to be made to the 
"fall through" path. However, let the DECODER 
discover a conditional branch within this datum and let 60 
the DHT indicate by its entry that the branch is taken. 
This event indicates that the "branch taken path'* must 
be fetched, as predicted by the DHT. For this sequence 
of events, we will assume that the BHT has incorrectly 
predicted the outcome of the branch and that the pre- 65 
diction made by DHT is correct. This will then direct 
the DECODER to decode instructions from the "pre- 
dicted target-address" path. 



For the example given above, there are several rea- 
sons to allow the prediction made by the DHT to over- 
ride the prediction made by the BHT; 

(a) A much larger DHT, in number of entries, can be 
used with a small BHT and still be much smaller 
than the overall size of the BHT. This allows the 
DHT to remember the branch results from many 
more branches than a BHT and still be significantly 
smaller than the BHT. For example, a DHT with 
4K entries is still only one sixteenth the size of a IK 
entry BHT, (4096 bits/( 1024x64) bits). 

(b) The BHT functions as a depository for all "taken 
branch" information and, because of its fixed size 
limitation, suffers from the same hit and miss statis- 
tics as does the cache. Recall that the instruction- 
fetch address must match one of the branch ad- 
dresses contained in the BHT to be a hit, whereas 
the DHT is accessed by using a subset of the ad- 
dress bits of the branch and always makes a predic- 
tion, 

(c) The BHT must record all taken branches, both 
conditional and unconditional, whereas the DHT 
must only record the outcome of conditional 
branches. Recall, all unconditional branches can be 
correctiy predicted by the DECODER and need 
not have their branch results stored as part of the 
DHT information. Thus, a BHT has many more 
distinct branches to predict than does a DHT. This 
aggravates the already "cache like" miss phenom- 
ena associated with a finite BHT directory. 

(d) All "new" (first time) taken branches will be a 
miss to the BHT but can still be contained in the 
DHT because of its larger size. These "new" 
branches may represent the initial execution of a 
branch or the re-execution of a branch that has 
aged out of the BHT because of its finite size. 
These conditions allow a DHT to correctly redi- 
rect the DECODER on conditional branches that 
are actually taken and have been predicted as "not 
taken" by the BHT. 

By predicting each branch twice, using a BHT at 
instruction fetch time and a DHT at decode time, it is 
possible to actually increase performance of the proces- 
sor while decreasing the overall hardware of the ma- 
chine. For example, a DHT with 16K entries working 
in conjunction with a BHT with IK entries represents 
significantly less hardware than a BHT with 2K entries. 
Also, the IK BHT and 16K DHT can predict a greater 
percentage of the branches correctly and thus provide 
an increase in performance over a larger BHT, 

The description presented with reference to the 
drawings is basic, and it will be understood by those 
skilled in the art that many of the actual features that 
make up a processor's design have been simplified or 
omitted. For example, fully associative directories are 
chosen for the BHT rather than more conventional set 
associative lookups that may be required in an actual 
implementation. A more detailed description of a BHT 
can be found in U.S. Pat. Nos. 3,559,183 and 4,679,141, 
and a more detailed description of a DHT can be found 
in U.S. Pat, No. 4,477,872. 

In the drawings, like reference numerals represent 
identical or similar parts. Referring now to the draw- 
ings, FIG. 2 shows the essential features of a processor 
that has a BHT. The actions of each component are 
described in accordance with each pipeline phase 
shown in FIG. 1. Each phase is described separately; 
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however, as in any pipeline processor, all phases of the 
pipeline occur in parallel. 

Instructions are stored in memory 10 and are fetched 
from the memory 10 and stored in cache 13, according 
to well known mechanisms. During the instruction- 5 
fetch phase of the pipeline, the instruction-buffer 11 
signals the BHT 12 via the **instruction-buffers not full" 
signal that space is available for another instruction 
fetch. The BHT then generates the next-instruction- 
fetch and sends the address to the cache 13. The I-fetch 
segment is then returned from the cache via path 14. 

During the decode phase, the BHT 12 signals the 
instruction-buffer 11 via path 15 to load the next- 
instruction register 16 with the appropriate next instruc- 
tion. The instruction loaded can be either the instruc- 
tion immediately following the previous instruction 
loaded, i.e., the "next-sequential instruction", or a 
branch-target instruction, depending of the information 
sent from the BHT. 

The instruction decode register 17 is loaded from the 
next-instruction register 16. The instruction is decoded 
and operation code (op-code) and execution informa- 
tion is assembled. After decoding the instruction, the 
execution information is sent to the execution unit 18 
where it is held until execution. If operands are required 
by the instruction, then the necessary information (base 
register values, index register, and displacements) are 
sent to the address generate mechanism 19. The output 
of the address-generate mechanism 19 is an address for 
an operand. The address is then sent to the cache 13. 
The cache then sends the operand information back to 
the execution unit 18 via path 21. 

The instruction waits at the execution unit 18 until the 
needed data is available from the cache 13. There, the 
instruction is executed and the results are put away as 
required. If a branch was executed, then the BHT up- 
date information is sent from the execution unit 18 back 
to the BHT 12 via path 22. The BHT update informa- 
tion is sent to assure that the BHT is correctly predict- 
ing the outcome of each branch. Finally, the Endop 
(end operation) function 23 is issued to mark the point in 
time when the instruction is successfully completed. 

In parallel with this action, the instruction-length 
information, generated by the decode mechanism 17, is 
sent to the address adder 25. There it is combined with 
the value from the instruction-counter (IC) register 26 
to form the next-instruction address. The output from 
the address adder 25 is then sent to the update-instruc- 
tion-counter register 27 and to the BHT 12 via line 28. 
The update-instruction-counter register 27 will then 
hold the value of the instruction to be decoded on the 
next cycle. This sequence of events is then repeated by 
the processor for the next cycle. 

A more detailed description of a BHT is shown in 
FIG. 3. Essentially, the BHT is used to effectively di- 
rect the instruction fetching policies of a processor and 
improve branch prediction accuracy by directing the 
right instruction, either the next sequential or branch 
target, to the decoder. To do this, it must examine the 
address of each instruction fetch made by the processor 
and detect when there is a taken branch contained 
within an instruction-fetch segment. Each instruction 
fetch is held in the instruction-fetch address-register 31. 
This address is compared by compare function 33 
against each branch address held in the BHT directory 
32. Usually each instruction-fetch segment contains 
more than one instruction. In present machines, an in- 
struction-fetch segment is several words long (i.e. a 
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double word of 8 bytes or quad word of 16 bytes) thus 
allowing the possibility for several instructions to exist 
within each instruction-fetch segment. Any instruction- 
fetch address that matches a branch address contained 
in the BHT will be termed a ''BHT-hit". Similarly, a 
*'BHT-miss" is used to denote an instruction-fetch seg- 
ment that misses all of the branch addresses contained in 
the BHT. That is, the address of the instruction fetch 
fails to compare with any of the instruction fetch seg- 
10 ments held in the BHT directory. 

If an instruction-fetch misses all BHT entries, then 
the next instruction-fetch address made by the proces- 
sor is the "next sequential" instruction-fetch segment, as 
determined by next sequential logic 34. If the instruc- 
15 tion-fetch address **hits in the BHT", then the processor 
will switch instruction streams, and the next-instruction 
address segment generated by the processor is the tar- 
get-address of the BHT entry that caused the "hit'*. The 
new target-address goes to gate 35 and becomes the 
20 new instruction-fetch address when a BHT-hit is de- 
tected. The next-sequential instruction-fetch is called 
for when a BHT-miss is detected via gate 36. 

If there is an address-match, a hit, then the branch- 
address (BA) and target-address (TA) are saved in the 
25 branch-address, target-address (BA/TA) stack 37, and 
the target-address becomes the next instruction-fetch 
address on the following cycle. The BA/TA stack is 
used to guide the pre-loading of instructions from the 
instruction-buffer to the next-instniciion-register. The 
30 information to load the instruction-buffer is sent via 
path 15. The information, an address, used to load the 
instruction buffer is obtained by comparing the branch- 
address of the oldest entry in the BA/TA stack 37 to the 
next-instruction-address sent via path 28. If these ad- 
35 dresses are equal as determined in function block 41, 
then the next-instruction-address should become the 
target-address for the BA/TA pair in the stack 37. 
Note, that this will happen if the BHT is conect in 
predicting the action of the branch. The target-address, 
40 for the BA/TA pair that matched, is held in function 
block 42 and signals unit 43 to select this address for the 
next-instruction address on path 15. If the compare in 
function 41 is not equal, then no select is issued and the 
next-instruction address from path 28 becomes the next- 
45 instruction address sent via path 15. The BA/TA pair is 
then removed from stack 37 and the process continues 
with the next oldest BA/TA pair found on the stack. 

The last-segment-fetch register, inside the next- 
sequential fetch controls next sequential logic 34, is 
50 loaded with the target-address value. Whenever a 
**ncxt-sequential" instruction fetch is called for, the 
value saved in the last-instruction-fetch register is 
bumped (increased by one unit) to the address of the 
next-instruction-fetch segment and becomes the new 
55 instruction-fetch address. 

So far, the BHT description presented has assumed 
that each prediction made by the BHT is correct. How- 
ever, the BHT will, on occasions, make incorrect deci- 
sions. Therefore, an update or correction mechanism 
60 must exist in the BHT. 

BHT prediction errors can be detected after the ad- 
dress-generation phase of an instruction or after the 
execution phase. BHT prediction errors that are discov- 
ered after the branch is executed incur the full pipeline 
65 delay. A signal to "restart the pipeline" is sent by the 
execution unit to the BHT whenever it is discovered 
that the outcome of a branch is different from the pre- 
dicted action, taken or not taken. The restart informa- 
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tion along with BHT correction information is sent via 
path 22. The pipeline can be "restarted earlier" by de- 
tecting a BHT error after the instruction decode and 
AGEN phase of the pipeline. This avoids several cycles 
of pipeline delay that would result by detecting a 5 
branch prediction error in the execution unit after the 
branch is executed. 

It is noted that a BHT prediction error can be de- 
tected at decode time whenever the BHT has predicted 
that there is no branch, a BHT miss, and the DE- 10 
CODER then decodes an unconditional branch. This 
information can also be sent to the BHT via path 22. 

An error in predicting the target-address of a branch 
can be detected after the address-generate function 19 
and is made available to the BHT 12 via path 28, as 15 
shown in FIG. 2. Whenever a branch is decoded by the 
decode function 17, the target address of the branch 
(the output of the address-generate function 19) is sent 
to the BHT 12 and compared agauist the predicted 
target-addresses held in the BA/TA stack 37 in FIG, 3. 20 
This logic is contained in function 40 of FIG. 3. When- 
ever the TA from the address-generate function 19 does 
not agree with the predicted TA, a correction is neces- 
sary. The target-address saved in the BHT entry must 
be changed to reflect the new, or current, target- 25 
address of the branch instruction. Corrections are sent 
via path 46 to the BHT directories 32. Pipeline restart 
information is sent via path 45 and is used to reload the 
I-fetch-address register 31 and purge the BA/TA stack 
37. The effects of a pipeline restart are discussed below. 30 

BHT corrections are also detected after the execution 
phase of an instruction. Correction information is sent 
via path 22. Here, the outcome of each branch, either 
taken or not taken, is sent to the BHT and compared 
against the predicted status of each branch currently 35 
held in the BA/TA stack 37. This is done in unit 47. If 
an error was in failing to predict a taken branch, then a 
new entry is made in the BHT directory. The new 
BA/TA information is sent to the BHT via path 46. If 
the error was predicting that a branch will be taken and 40 
if it is in fact not taken, then the BA/TA entry currently 
held in the BHT must be deleted. This correction infor- 
mation is also sent to the BHT via path 46. 

Once a prediction error has been detected in the 
processor, the pipeline will have to be restarted. For 45 
example, consider the case were a branch was predicted 
as not taken and the branch was actually taken. The 
instruction fetching mechanism of the BHT was such 
that the target of the branch was never fetched and now 
the pipeline is idle until the correct instruction informa- 50 
tion is fetched and put in the instruction buffer. Once 
the correct instruction is fetched from the cache, the 
normal pipeline flow can continue. To accomplish this 
'^pipeline restart", information is sent via path 45. The 
restart logic will cause the instruction-fetch-address 31 55 
to be loaded with a new instruction address (in this case 
the target of branch). This allows the instruction-fetch- 
ing sequence to begin anew. The * 'pipeline restart" 
procedure also causes the instruction-buffer 11 in FIG. 
2 to be purged and the BA/TA stack 37 in FIG. 3 to be 60 
emptied. The processor then waits until the correct 
instruction (contained in the restart address) is fetched 
from the cache and the normal pipeline flow can con- 
tinue. It is noted that there are several policies that can 
be implemented to restart the pipeline. Many policies 65 
are more complicated than then one described above. 
For example, the instruction-buffer 11 may already 
contain the correct instructions to restan the pipeline 
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and execution can proceed by fetching these instruc- 
tions from the instruction-buffer with little or no delay. 
Additional hardware would be needed to accomplish 
this. However, in this embodiment a simple restart pro- 
cedure is used and it is assumed that the pipehne is 
restarted after each BHT error. 

A description of the actions of a processor with a 
DHT is presented next. FIG. 4 shows the essential fea- 
tures of a processor with a DHT. The figure shows that 
a processor with a DHT shares many of the features 
found in a processor having a BHT. However, the main 
difference is that the DHT only predicts the outcome of 
each conditional branch, either taken or not taken, and 
that the instruction fetching logic of a BHT is replaced 
with a much simpler mechanism that can only generate 
"next sequential" fetches. 

Most of the components in a processor with a DHT 
have a similar function as was found in a processor with 
a BHT. However, there are some important differences. 
The cache 13 again supplies instruction-fetch informa- 
tion to the instruction-buffer 11, The instruction-buffer 
(IB) 11 holds the instruction-fetch segments that have 
been returned from the cache 13. There can be a muhi- 
ple of instruction buffers. Each separate IB is used, as in 
any multi-instruction stream processor, to hold restart 
information (to restan the pipeline if the prediction 
made by the DHT is wrong), instruction-fetch informa- 
tion (instruction-fetch segments for each instruction 
stream currently in the pipeline) and alternate instruc- 
tion path information. For example, a detailed descrip- 
tion of a multi-instruction stream multi-instruction 
buffer processor can be foimd in U.S. Pat, No. 
4,2(X),927. The next-instruction register 16 is loaded 
from the instruction buffers. The instruction-buffers use 
the next-instruction address, which is input to the up- 
date-instruction-counter (update-IC) 27 to determine 
which instruction to load into the next-instruction-regis- 
ter 16. 

The instruction decode logic 17 produces the follow- 
ing outputs: 

provides the execution unit with execution informa- 
tion about the instruction. 

provides the address generate function with operand 
information (base register, index register, and dis- 
placement values). 

provides the address adder 25 with instruction length 
information. 

The address-adder 25 then combines the instruction- 
length value with the instruction-counter value 26 to 
produce the next-instruction address. This value is then 
saved in the update-instruction-counter 27. Operand 
addresses are computed in the address-generate unit 19, 
and the fetch requests and addresses are sent to the 
cache 13. Operands are returned to the execution unit 
18 via path 21. There, the instruction is executed and the 
results put away. The Endop unit 23 then signals the 
completion of the instruction. 

Consider now that the decoder 17 detects an uncon- 
ditional branch. The target-address of the branch 
(which is an instruction) is fetched from the cache 13 
and the target instruction is loaded into the next-instruc- 
tion register 16 via path 21 and gate 51. The target- 
address of the branch is made available from the address 
generate-mechanism 19. The target-address is also sent 
to the update-IC 27, and the sequential-prefetching-con- 
trol (SPC) 52 via paths 53 and 54, respectively. The 
SPC 52 will then begin to fetch instruction segments 
starting with the "new" target-address of the branch. 



02/26/2004, EAST Version: 1.4.1 



5,353 

11 

If the decoder 17 detects a conditional branch then 
the DHT 55 will be accessed via path 91 to predict the 
outcome of the branch, either taken or not taken. The 
DHT 55 uses a subset of the address bits, contained In 
the instruction-counter 26, to examine the DHT array 5 
contained in the DHT 55. If the prediction of the 
branch '*is taken", then the target-address is fetched 
similarly as was the target-address for the unconditional 
branch. That is, the target-address is fetched from the 
cache 13 and the target instruction is loaded into the 10 
next-instruction register 16 via gate 51. The target- 
address is also sent to the update-IC 27 via path 53, and 
the DHT 55 signals the sequential-prefetching-control 
52 via path 56 to reset its instruction-fetch address to the 
predicted-target-address of the branch to begin new 15 
next-sequential fetches. The target-address of the 
branch is sent to the SPC 52 via path 54. 

Instruction-fetch logic is contained in the sequential- 
prefetching-control s (SPC) 52. FIG. 5 gives a more 
detailed description of this mechanism. Here, the ad- 20 
dress of the last instruction-fetch is saved in the last-seg- 
ment-fetched register 61 and whenever the instruction 
buffers signal the SPC 52 that space is available for 
another Instruction-fetch, via gate 62, the next-sequen- 
tial instruction-fetch-address is generated and sent to 25 
the cache 13. Whenever an unconditional branch or a 
predicted-taken conditional branch is decoded and the 
SPC 52 is signaled via path 56 to reset its instruction- 
fetch-address to the target-address of the branch in- 
struction, the last-block-fetched register 61 is loaded 30 
with the target-address of the branch. This address is 
provided by the address-generate function 19. Thereaf- 
ter, all next sequential fetches can be generated from 
within the SPC 52, The target-address is sent to the 
SPC 52 via path 54 and loaded through gate 63 on a 35 
signal from the DHT 55, via path 56, that an uncondi- 
tional branch or predicted taken conditional branch was 
decoded. 

Branch predictions are made by the DHT 55. FIG. 6 
gives a more detailed description of a DHT. As men- 40 
tioned above, only conditional branches are predicted 
by the DHT since all unconditional branches can be 
correctly predicted once they are decoded. The DHT is 
signaled from the decode-function 17, via path 91, 
whenever a conditional branch is decoded. The DHT 45 
array 71 is accessed by using a subset of the bits that 
represent the instruction address, held in the instruc- 
tion-counter register 26. Each array entry consists of 
only a single bit whose value represents the outcome of 
a branch at this memory location, up to the size of the 50 
table. For example, if the entry examined in the DHT 
array 71 is a one (a DHT hit) then the branch is guessed 
taken, or if the value found is a zero (a DHT miss) then 
the branch is guessed not taken. Here, the terms "DHT- 
hit" and "DHT-miss" correspond to the terms used for 55 
a BHT, i.e., a "BHT-hit" and '*BHT-mjss", and denote 
if the branch is predicted as taken or not-taken, respec- 
tively 

DHT correction information is sent to the DIIT from 
the execution unit 18, via path 22. The correction infor- 60 
mation contains the results of the branch's execution, 
either taken or not taken, and the address of the branch. 
The DHT is updated using this information. Since the 
DHT only predicts the action of each conditional 
branch, only the execution results of each branch needs 65 
to be returned to the DHT, 

A general description of the operation of a processor 
with a BHT or DHT has been presented. FIGS. 7A and 
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7B shows sets of rules in tabular form that summarize 
the events and actions that occur in these processors. 
The events are listed according to decode-time and 
instruction-fetch time actions, and BHT and DHT hit or 
miss results. For example, the FIG. 7A shows that a 
"BHT-hit'* (a prediction of taken) causes the instruction 
fetching mechanism to switch to the target-address of 
the branch (at instruction-fetch time). However, on a 
BHT-miss, the instruction fetching mechanism will 
continue to fetch instructions down the next sequential 
path. During the decode phase of the pipeline, the de- 
coder will switch 10 the target-address stream as identi- 
fied by the BHT-hit or continue to decode down the 
next sequential path if no BHT entry was found (a 
BHT-miss). 

The actions taken for a processor with a DHT are 
different, as shown in FIG. 7B. During the instruction 
fetch phases, the instruction-fetch mechanism can only 
fetch the next-sequential instructions regardless of any 
prediction outcome. However, if a branch is discovered 
at decode time and predicted as taken (a DHT hit) then 
the instruction fetching mechanism will begin to fetch 
instruction segments down the target-address path. Sim- 
ilarly, the decoder will switch to decode instructions 
down the target-address path when a DHT hit occurs. 
If the prediction is a DHT miss, then the processor will 
continue to fetch instructions down the next-sequential 
path and also decode the fall-through instruction on the 
next cycle. 

Next a specific embodiment for the current invention, 
a multi-prediction branch prediction mechanism, is 
presented. This mechanism uses both a BHT and DHT 
to predict branches and in so doing improves the pre- 
diction process by offering features that neither predic- 
tion mechanism (DHT/BHT) can provide separately. 
These extra features will then improve overall branch 
prediction accuracy and thus improve performance. 
FIG. 8 shows a table that summarizes the events and 
actions that describe the current invention. Each BHT- 
hit is now divided into two categories: an **active-taken- 
hit" and a "ghost-hit". 

An active-taken-hit is used to describe what was pre- 
viously called a BHT-hit. That is, the instruction fetch 
address matches the branch address saved in the BHT 
and the branch is currently taken. However, a "ghost- 
hit" describes when the instruction-fetch address 
matches a branch address saved in the BHT but the 
branch is no longer taken. FIG. 9 is used to explain 
these differences in detail. The figure shows the format 
for the BET directory, array, and block entries. Each 
directory entry 81 represents the address of an instruc- 
don-fetch segment that contains as least one previously 
executed taken branch. The array information 82, asso- 
ciated with each directory entry, identifies (a) the ad- 
dress of each taken branch contained within an instruc- 
tion-fetch segment, (b) the target-address of each 
branch, (c) a taken, not-taken bit, (d) a valid bit, and (e) 
LRU usage bits. 

Consider a taken branch contained ixi an instruction 
fetch segment will address X, It will have a value of X 
in the BHT directory 83 and corresponding segment 
information 84. Note that the directory entry represents 
the address of an instruction-fetch segment, and so a 
multiple of taken branches can e.xist within each seg- 
ment. Recall that each instruction-fetch-segment is usu- 
ally several words long and thus contains several in- 
structions. The array-segment-entry information then 
contains information identifying each taken branch 
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found within each instruction-fetch segment. FIG. 9 
shows that each segment entry can identify up to four 
taken branches within an instruction-fetch segment. 
This is more that adequate for instruction-fetch seg- 
ments of a quad word (16 bytes). Each sub-segment 5 
entry contains information identifying the branch ad- 
dress within the instruction-fetch segment. This field 
needs to be only three bits wide to identify an instruc- 
tion within a quadword segment in the IBM S/370 
architecture since instructions can . only begin on each 10 
halfword boundary. The target-address must be a full 
instruction address since a branch can branch to any 
instruction-fetch segment. However, it is noted that the 
target-address can be abbreviated or truncated when 
stored m the BHT. It is possible to save only a subset of 15 
the low order bits that make up the full target-address. 
The full target-address can then be reconstructed by 
concatenating the current high order bits from the in- 
struction-counter register to the low order bit saved in 
the BHT. This technique tries to take advantage of the 20 
observation that branches typically jump a short dis- 
tance and very little accuracy will be lost in generating 
the full target-address from only a reduced set of ad- 
dress bits. 

The valid bit is used to denote which of B A/TA pairs 25 
in each block, up to four, are valid. As mentioned 
above, provisions are provided to remember up to four 
branch address within each instruction-fetch segment. 

The taken bit is needed because a branch can change 
its actions. That is, initially a branch must be taken to be 30 
entered into the BHT, However, in subsequent execu- 
tions the branch can fail to be taken. This action would 
result in a BHT correction. Recall that a BHT correc- 
tion occurs for all branches that are predicted as taken 
and then fail to be taken when executed. The correction 35 
mechanism must then *'tum off the taken bit for this 
BHT entry. If an instruction fetch segment matches a 
branch address with a "taken bit'* turned off, then a 
ghost-hit is identified. Ghost hits are used to describe a 
hit to a BHT entry (a branch) that is no longer taken. 40 

The usage bits are needed because an instruction- 
fetch segment can contain more than four taken 
branches. If this occurs, then a replacement mechanism 
will be used to keep the most currently referenced 
branch addresses and discard the branch address that 45 
was referenced the furthest time in the past. Typically, 
a least-recently-used (LRU) algorithm or form of LRU 
algorithm is used as the replacement algorithm. 

Returning now to FIG. 8, we see that each BHT 
prediction (hit, ghost-hit, miss) is now paired with a 50 
DHT prediction (hit or miss). The events are listed 
according to the instruction -fetch and decode-time pha- 
ses of the processor. 

For example, a valid-BHT-hit causes the instruction 
fetch mechanism to fetch the target of the branch found 55 
in the BHT. This occurs regardless of the DHT predic- 
tion, hit or miss. At decode time, the processor will 
switch to decode instructions from the target-address 
path. Instructions from the target-address path will be 
decoded by the processor for either a DHT hit or miss. 60 

However, each "ghost-hit*' prediction (at instruction- 
fetch time) causes the instruction-fetch mechanism to 
continue to fetch instructions from the next-sequential 
path. During the decode phase of the pipeline, the pro- 
cessor will continue to decode down the next-sequential 65 
path regardless of the DHT prediction, a hit or a miss. 

For a BHT- miss, the processor will continue to fetch 
the next-sequential segments. At decode time, if the 
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DHT also predicts that the branch is not taken, then the 
processor will continue to decode instructions from the 
next-sequential path. However, if the initial prediction 
made by the BHT was a miss (not taken) and the DHT 
predicts that the branch will be taken, a DHT hit at 
decode time, then the processor will stop fetching the 
next-sequential path and switch to fetch instructions 
from the target-address path. Note that the target- 
address of the branch is available from the address gen- 
erate function. When the target-address instructions are 
available in the instruction-buffer, the DECODER will 
also switch to decode instructions from the target ad- 
dress path identified by the branch. 

FIG. 10 shows the features of a processor according 
to the preferred embodiment of the invention that uses 
both a DHT and BHT for branch prediction. Most of 
the features in the figure have a similar functions as was 
described for a processor with just a BHT (FIG. 2) or a 
processor with a DHT (FIG. 4). For example, the cache 
13 supplies instructions to the instruction-buffer 11 as 
well as operands to the execution unit 18. The next- 
instruction-register 16 contains the next instruction to 
be processed by the instniction-decode-mechanism 17. 
The BHT 12 performs branch prediction, controls in- 
struction fetching and directs the flow of instructions 
through the pipeline. The BHT 12 signals the instruc- 
tion-buffer 11 to load the next-instruction-register 16 
with the appropriate next instruction, either the next- 
sequential-instruction or branch-target-instruction. The 
next-instruction fetch is generated by the BHT 12 on a 
signal from the instruction buffer 11 that space is avail- 
able. The next-instruction fetch can be either the next- 
sequential address or the target-address of a branch 
instruction, depending on the results of the previous 
instruction- fetch, a BHT-hit or BHT-miss. 

The operations of the DHT 55 are similar to the 
operation of a DHT described in FIG. 4. That is, the 
DHT 55 predicts conditional branches during the de- 
code phase of the pipeline and signals the BHT via path 
56 of the branch prediction results, either a hit or miss. 
Along with the branch prediction results, the address of 
the branch as well as the target-address of the branch is 
also supplied via path 56. This information will be used 
by the instruction-fetching logic in the BHT 12 to verify 
that the branch-predictions made by the BHT (made 
during the instruction-fetch phase of the pipeline) are 
correct. However, it is noted that the target-address of 
the branch is already supplied to the BHT 12 via path 28 
and need not be duplicated via path 56. By including 
this information in the signal on path 56, the pipeline- 
restan and instruction-fetching-logic are simplified. 

The DHT 55 is signaled via path 91 from the instruc- 
tion-decode-function 17 whenever a conditional branch 
is decoded. The address of the branch is supplied to the 
DHT 55 from the instruction-counter register 26. BHT 
and DHT corrections are signaled via path 22 from the 
execution unit 18. The address-adder mechanism 25 
takes as its input the address of the current instruction 
being decoded, found in the IC register 26, and the 
length of the instruction being decoded and computes 
the next-instruction address. The new address is saved 
in the update-IC register 27. The DHT prediction logic 
for determining if a branch is taken or not-taken is the 
same as described in FIG. 6. This information is com- 
bined with the branch prediction logic found in the 
BHT to provide an improved branch scheme that nei- 
ther a BHT or DHT can provide alone. 
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Several changes are made to the BHT as described in 
FIG. 3. FIG. 11 shows in detail how the correct 
BA/TA pair is selected from the BHT entries as de- 
scribed in FIG. 9- A **BHT-hit" is the result of an in- 
struction-fetch-address 31, in FIG. 3, matching the ad- 5 
dress of an instruction-fetch-segment saved in the BHT 
82, of FIG. 9. The comparative logic is contained in 
function 102. The instruction-fetch-address used in the 
comparison can be divided into two parts, 103 and 104. 
Part 103 is used to compare against each instruction- 10 
fetch-segment saved in the BHT. This portion of the 
address need only specify the address of an instruction- 
fetch down to a double-word or quad-word, depending 
of the width of the address-fetch bus used to fetch in- 
struction from the cache or memory. Part 104 repre- 15 
sents a sub- address (SA) portion of the instruction-fetch 
address and identifies which instruction within an in- 
struction-fetch-segment caused the BHT entry. This 
instruction would be a taken branch. This part of the 
address is not immediately needed in detecting a "BHT- 20 
hit**. It is only used after an initial match of the instruc- 
tion-fetch-address and the instruction-fetch-segment 
saved in the BHT. After a '*BHT-hit' * is detected, the 
correct branch within the instruction-fetch segment 
must be selected. Recall, there can be several taken 25 
branches within each instruction-fetch-segment. 

The desired branch is identified as the closest branch 
encountered after the instruction fetch address. The 
appropriate branch address is selected by the select 
logic 105 according to the following rules: 30 

Subtract the SA field from each BA field in the iden- 
tified BHT-segment. 

Disregard negative and invalid differences. 

Select the BA with the smallest or zero difference. 

The full BA is then created in the select gate logic 35 
106. The selected BA 104 value from the select logic 
105 is appended to the instruction-fetch-address 103 to 
create the full branch address. This address is then 
paired with the appropriate TA to form a BA/TA pair. 
The BA/TA pair represents a **ghost-hit" if the taken- 40 
bit is zero, T = 0, or a valid-taken-hit if T=l. The se- 
lected BA^A pair is then saved in the BA/TA stack 37 
of FIG. 3 to guide the later selection of the "next- 
instruction" from the instruction-buffer 11 into the next- 
instruction-register 16 of FIG. 2, The selected target- 
address is also gated to two other locations. First, the 
TA is used to update the next-instruction-fetch-address 
31 of FIG. 3. Second, the target-address is used to reset 
the next-sequential-instruction-fetch control 34 of FIG. 
3. 

FIG. 12 gives a detailed description of the prediction 
logic that uses a DHT and BHT. Parts of the logic are 
similar to the prediction logic described in FIG. 3. 
Again, each instruction-fetch address, contained in 
next-instruction fetch-address 31, is compared against 
each instruction-fetch-segment contained in the BHT. 
The comparative logic is contained in function 33. 
When a match is detected, a **BHT-hit**, the appropriate 
BA/TA address is saved in the BA/TA stack 37. Both 
valid-taken-hits (the taken-bit T=l) and ghost-hits 
(T=0) are saved in the BA/TA stack. These BA/TA 
address pairs are used to guide the loading of the next- 
instruction-register 16 from the instruction-buffer 11 
with the appropriate next-instruction. 

If the "BHT-hit" is a valid-taken-hit (T= 1), then the 
TA will also be sent to the instruction-fetch-rcgistcr 31 
and to the next-sequential-fetch mechanism 34. The 
selected TA becomes the next-instruction-fetch on all 
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valid-taken-hits and is also used to reset the next- 
sequential-feich controls 34. 

The prediction results of the DHT, provided via path 
56, and the prediction made by the BHT.. provided by 
the BA/TA stack 37, are examined in function 111, The 
BA of the oldest entry in the BA/TA stack 37 is com- 
pared to the address of the branch just decoded. If these 
addresses are not equal then the processor has decoded 
a branch that the BHT has predicted as not taken, a 
BHT-miss. The processing logic then proceeds to func- 
tion 112 where the prediction made by the DHT is 
examined. If the DHT predicted that the branch will be 
taken, then it is assumed that the prediction made by the 
BHT is incorrect and the correct prediction is the 
DHT's. This causes the pipeline to be restarted, func- 
tion 113, and the instruction-fetch and decode-time 
policies for a BHT-miss and DHT-hii described in FIG. 
8 are enforced. The target-address of the branch just 
decoded becomes the next-instruction-fetch address, 
and the pipeline proceeds down the target-address path. 
Recall; that the DHT can remember the outcome of a 
larger number of branches than a BHT and still be 
smaller in size. Thus, the prediction of branch not taken, 
a BHT-miss, can be the result of a taken branch that has 
aged out of a small BHT and is still retained in a large 
DHT. 

If the prediction made by the DHT is not-taken, then 
function 43 is signaled to load the instruction following 
the branch as the next-instruction loaded from the in- 
struction-buffers to the next-instruction-register 16. 

Returning to the logic of function 111, if the BA from 
the BA/TA stack 37 equals the address of the branch 
being decoded, then the BHT-hit policies of FIG. 8 are 
implemented. If the branch is identified as a valid-taken- 
liit (T= I), then function 42 is signaled to use the target- 
address instruction as the next-instruction loaded from 
the instruction-buffer 11. If the BHT entry identifies a 
ghost-hit (T=0), then function 43 signals the next- 
instruction loaded from the instruction-buffers is the 
next-sequential or fall-through instruction. 

Function 41 is again used to compare the next- 
instruction address, returned via path 28 against the BA 
of the oldest entry in the BA/TA stack 37. This guides 
the loading of instructions from the instruction-buffers 
into the next-instruction-register 16. Function 40 com- 
pares the target-address of a branch, calculated by the 
address-generate-function 19 of FIG. 2, with the target- 
address of the branch saved in the BA/TA stack 37. If 
the saved target-address is different than the generated 
larget-address at decode, then the pipeline must be re- 
started. The restart information is sent via path 45. BHT 
corrections are sent to the correction-handling-stack 47. 
Here, the execution results of each branch, either taken 
or not-taken, are checked with the predicted results and 
updates, new-entries, or branch deletions are scheduled 
to the BHT via update path 46 when necessar>'. 

Finally, FIG. 13 gives a detailed description of the 
comparative logic used to examine the predictions made 
by the BHT and DHT. Function 111 compares the 
address of the branch just decoded, sent via path 56, 
with the BA of the oldest entry in the BA/TA stack 37. 
If the addresses are equal, identifying a BHT-hit, then 
the taken, not-taken bit is examined in function 122. The 
output of function 122 determines if the selected branch 
is a valid -taken-bit or a ghost-hit. If the comparative 
logic of 122 determines that the branch address from the 
BA/TA stack 37 is not equal to the address of the de- 
coded branch then a BHT-miss is identified. 
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The select gates, 123 to 128, determine which of the 
instruction-fetch/decode-time poHcies to follow as de- 
scribed ill FIG. 8, Each gate represents an AND gate 
and the output is sent to the following units (in FIG. 12): 

Unit 43 to decode the next-sequential instruction as 
the next instruction. 

Unit 42 to decode the target-address of the branch as 
the next instruction. 

Unit 113 to restart the pipeline. This causes the target- 
address of the branch to be fetched and then de- 
coded as the next instruction. 

In the embodiment described above, each branch is 
predicted twice, once during the instruction-fetch phase 
of the pipeline and then again at decode time. The em- 
bodiment then uses a set of rules that determine the 
prediction whenever the two independent branch pre- 
diction policies disagree. More specifically, the two 
branch prediction schemes described in the embodi- 
ment are a BHT and a DHT. The set of rules to arbitrate 
any branch prediction differences are described in FIG. 
8. The BHT and DHT are a preferred choice in branch 
prediction policies because each separately has a very 
high success rate in predicting branches, and when used 
together, an even higher percentage of branches can be 
guessed successfully. 

It would be a simple matter, however, to substitute 
one of the other branch prediction schemes as men- 
tioned in the prior art for either the BHT or the DHT 
prediction scheme and still maintain the spirit of this 
invention. For example, branch prediction can be ac- 
cording to op-code type as described in U.S. Pat. Nos. 
4,181,942 and 4,200,927. Either of these can be substi- 
tuted for the BHT and DHT. Then, a set of rules, simi- 
lar to the set of rules described in FIG. 8, is needed to 
resolve differences when the branch prediction schemes 
do not agree. In fact, a third branch prediction policy 
can be used to arbitrate those cases when the first two 
branch prediction schemes do not agree. 

For example, if the predictions made by the BHT and 
DHT do not agree, then another branch prediction 
scheme (possibly predicting each branch by its op-code 
type) can be used to arbitrate these differences. Alterna- 
tively, a multi-prediction branch prediction mechanism 
can consist of three separate branch prediction policies 
with the actual prediction of each branch being deter- 
mined by an agreement of any two. Those skilled in the 
art will recognize that a different branch prediction 
scheme can be substituted for those described for the 
preferred embodiment of the invention, or the existing 
branch prediction schemes can be modified and the 
invention can still be practiced, with modification, 
within the spirit and scope of the appended claims. 

Having thus described our invention, what we claim 
as new and desire to secure by Letters Patent is as fol- 
lows: 

1. A multi-prediction branch prediction mechanism 
for predicting the outcome of branch instructions in a 
computer having a pipeline proces.sor having a plurality 
of different processing stages and including an instruc- 
tion buffer connected to a memory for temporarily 60 
storing fetched instructions, an instruction decode cir- 
cuit connected to said instruction buffer and decoding 
instructions temporarily stored in said instruction 
buffer, an address generate circuit responsive to said 
decoded instructions from said instruction decode cir- 
cuit addressing said memory to fetch instructions to said 
instruction buffer, and an execution unit responsive to 
decoded instructions from said instruction decode cir- 
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cuit for performing operations in accordance with de- 
coded instructions, said multi-branch prediction mecha- 
nism comprising: 
at least two, independent branch prediction means 
connected to said pipeline processor for indepen- 
dently predicting a branch instruction at different 
stages of said plurality of stages of said pipeline, 
said different stages corresponding to different 
positions of said pipeline and each having a time of 
processing different from one another; 
means responsive to said two, independent branch 
prediction means for resolving those instances 
when predictions from each of said two, indepen- 
dent branch prediction means differ; and 
means responsive to said execution unit for updating 
said two, independent branch prediction means 
based on execution of said branch instruction. 

2. The multi-prediction branch prediction mechanism 
recited in claim 1 wherein one or more of said at least 
two, independent branch prediction means comprise an 
instruction-fetch branch prediction means connected to 
said instruction buffer and to said address generate 
mechanism for generating an initial branch prediction 
based on a limited history of branches taken and supply- 
ing to said address generate mechanism a target address. 

3. The multi-prediction branch prediction mechanism 
recited in claim 1 wherein one or more of said at least 
two, independent branch prediction means comprise a 
decode-time branch prediction means connected to said 
instruction decode mechanism and to said instruction- 
fetch branch prediction mechanism for generating a 
branch prediction based on a history of branches exe- 
cuted. 

4. The multi-prediction branch prediction mechanism 
recited in claim 1 wherein said at least two, independent 
branch prediction means comprise; 

an instruction-fetch branch prediction means con- 
nected to said instruction buffer and to said address 
generate mechanism for generating an initial 
branch prediction based on a limited history of 
branches taken and supplying to said address gen- 
erate mechanism a target address; and 

a decode-time branch prediction means connected to 
said instruction decode mechanism and to said 
instruction-fetch branch prediction mechanism for 
generating a branch prediction based on a history 
of branches executed, 

5. The multi-prediction branch prediction mechanism 
recited in claim 4 wherein said instruction-fetch branch 
prediction means comprises a branch history table 
(BHT) in which are stored a set of recently executed 
branches followed by their target-addresses and said 
decode-time branch prediction means comprises a de- 
code history table (DHT) in which are stored records of 
actions of each branch. 

6. A multi -prediction branch prediction mechanism 
according to claim 1, wherein said at least two, indepen- 
dent branch prediction means comprise first and second 
branch prediction means, said first branch prediction 
means being of a type different from that of said second 
branch prediction means. 

7. A multi-prediction branch prediction mechanism 
for predicting the outcome of branch instructions in a 
computer having a pipeline processor including an in- 
struction buffer connected to a memory for temporarily 
storing fetched instructions, an instruction decode cir- 
cuit connected to said instruction buffer and decoding 
instructions temporarily stored in said instruction 
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buffer, an address generate circuit responsive to said 
decoded instructions from said instruction decode cir- 
cuit addressing said memory to fetch instructions to said 
instruction buffer, and an execution unit responsive to 
decoded instructions from said instruction decode cir- 5 
cuit for performing operations in accordance with de- 
coded instructions, said multi-branch prediction mecha- 
nism comprising: 
at least two, independent branch prediction means 
connected to said pipeline processor for indepen- 10 
dently predicting branch instructions at different 
stages of said pipeline, said at least two, indepen- 
dent branch prediction mean comprising an in- 
struction-fetch branch prediction means connected 
to said instruction buffer and to said address gener- 15 
ate circuit for generating an initial branch predic- 
tion based on a limited history of branches taken 
and supplying to said address generate circuit a 
target address, and a decode-time branch predic 



paring means for accepting said initial branch pre- 
diction from said instruction-fetch branch predic- 
tion means when a match occurs between said 
initial branch prediction and said branch prediction 
from said decode-time branch prediction means but 
overriding said initial branch prediction by select- 
ing said branch prediction from said decode-time 
branch prediction means and restarting said pipe- 
line when said initial branch prediction and said 
branch prediction from said decode-time branch 
prediction means do not match; and 
means responsive to said execution unit for updating 
said two, independent branch prediction means 
based on execution of a branch instruction. 
8. The multi-prediction branch prediction mechanism 
recited in claim 7 wherein said instruction-fetch branch 
prediction means comprises a branch history table 
(BHT) in which are stored a set of recently executed 



tion means connected to said instruction decoded 20 ^^^anches followed by their target-addresses and said 



circuit and to said instruction-fetch branch predic- 
tion means for generating a branch prediction 
based on a history of branches executed; 
means responsive to said two, independent branch 



decode-time branch prediction means comprises a de- 
code history table (DHT) in which are stored records of 
actions of each branch. 
9. Tlie multi-prediction branch prediction mechanism 



prediction means for resolving those instances 25 recited in claim 8 wherein said means responsive to said 
when prediction from each of said two, indepen- execution unit for updating said two, independent 
dent branch prediction means differ, said means branch prediction means based on execution of a branch 
responsive to said branch prediction means com- instruction comprises a correction handing circuit con- 
prising comparing means for comparing said initial nected to said execution unit and said branch history 
branch prediction from said instruction-fetch 30 table and said decode history table to correct informa- 
branch prediction means and said branch predic- tion stored in said tables based on execution of a branch 
tion from said decode-time branch prediction instruction. 

means, and selection means responsive to said com- ***** 
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